Randy Johnson
6/21/2018
# some made up data
linExample <- data_frame(x1 = rnorm(100),
x2 = rnorm(100),
y = x1 + x2^2 + rnorm(100))mvnExample <- data_frame(x1 = rnorm(100),
y1 = x1 + rnorm(100), # normal error terms
y2 = x1 + rt(100, 3)) # not-normal error terms##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.98544, p-value = 0.3417
##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.91781, p-value = 1.079e-05
Model 1 has normal error terms.
##
## Call:
## lm(formula = y1 ~ x1, data = mvnExample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7996 -0.6230 0.1467 0.7049 1.9721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15329 0.09750 -1.572 0.119
## x1 0.91487 0.09161 9.987 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9619 on 98 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.4993
## F-statistic: 99.74 on 1 and 98 DF, p-value: < 2.2e-16
Model 2 has non-normal error terms.
##
## Call:
## lm(formula = y2 ~ x1, data = mvnExample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4554 -1.1273 -0.1941 1.0091 8.7573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1716 0.1932 0.888 0.377
## x1 0.8026 0.1815 4.421 2.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.906 on 98 degrees of freedom
## Multiple R-squared: 0.1663, Adjusted R-squared: 0.1578
## F-statistic: 19.55 on 1 and 98 DF, p-value: 2.539e-05
mcol <- data_frame(x1 = rnorm(100),
x2 = x1 + rnorm(100), # x1 and x2 are correlated
x3 = rnorm(100),
y = x1 + x2 + x3 + rnorm(100))
# three models
model1 <- lm(y ~ x1 + x2 + x3, data = mcol) # everything
model2 <- lm(y ~ x1 + x3, data = mcol) # without x2
model3 <- lm(y ~ x2 + x3, data = mcol) # without x1# check for multicollinearity
vif(model1)## x1 x2 x3
## 2.271759 2.271338 1.005520
vif(model2)## x1 x3
## 1.000791 1.000791
vif(model3)## x2 x3
## 1.000606 1.000606
| model | x1 | x2 | x3 | r2 |
|---|---|---|---|---|
| 1 | 0.94 | 1.02 | 0.98 | 0.87 |
| 2 | 1.88 | 0.93 | 0.76 | |
| 3 | 1.59 | 1.03 | 0.81 |
Model coefficients for our three models. Notice how the coefficient for x3 doesn’t change much, but the coefficients for x1, x2 and R2 change quite a bit.
Adding a loess smoother highlights this periodic pattern in the data.
lm(y1 ~ x1, data = dat) %>%
durbinWatsonTest()## lag Autocorrelation D-W Statistic p-value
## 1 0.1940815 1.609687 0.002
## Alternative hypothesis: rho != 0
y1 ~ x1
##
## Suggested power transformation: 1.030183
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.05980537 Df = 1 p = 0.8068038
y2 ~ x1
##
## Suggested power transformation: 0.1781482
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 42.36333 Df = 1 p = 7.579794e-11
##
## Call:
## lm(formula = y1 ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86192 -0.55272 -0.04338 0.54027 2.54020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04454 0.20709 0.215 0.82991
## x1 0.21653 0.06599 3.281 0.00122 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9762 on 198 degrees of freedom
## Multiple R-squared: 0.05158, Adjusted R-squared: 0.04679
## F-statistic: 10.77 on 1 and 198 DF, p-value: 0.00122
##
## Call:
## lm(formula = y2 ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2229 -1.1031 -0.2710 0.8684 10.1981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3360 0.4025 0.835 0.405
## x1 0.6483 0.1282 5.055 9.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.897 on 198 degrees of freedom
## Multiple R-squared: 0.1143, Adjusted R-squared: 0.1098
## F-statistic: 25.56 on 1 and 198 DF, p-value: 9.755e-07
| Assumption | Cause | Consequence | Diagnosis |
|---|---|---|---|
| Linear Relationship | Incorrect model | Inaccurate predictions | car::crPlots() |
| Multivariate Normality | Incorrect model | Incorrect statistics | car::qqPlot() |
| Noisy data | (p-values / CIs) | shapiro.test(residuals) | |
| No/Little Multicollinearity | Correlated variables | Unstable model coefficients | car::vif() |
| No Autocorrelation | Non-independent data | Inefficient estimators | car::durbinWatsonTest() |
| Homoscedasticity | Incorrect model | Incorrect statistics | car::ncvTest() |
| “Bad” data | (p-values / CIs) | car::spreadLevelPlot() |
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 20 3.427632 0.001312 0.062977
## StudRes Hat CookD
## 20 3.4276315 0.02443305 0.11925621
## 27 0.5655804 0.14003561 0.02643539
## rstudent unadjusted p-value Bonferonni p
## 49 6.307264 1.0012e-07 4.9057e-06
## StudRes Hat CookD
## 49 6.307264 0.1480318 1.893596
Many of the problems we encounter in linear regression stem from model choice.
Two most common transormations:
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1x2 + ε
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1sqrt(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: y ~ β0 + β1ln(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: ln(y) ~ β0 + β1x + ε